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Abstract — In  wireless  ad  hoc  networks,  autonomous  nodes  are 
reluctant  to  forward  others’  packets  because  of  the  nodes’  limited 
energy.  However,  such  selfishness  and  noncooperation  deteriorate 
both  the  system  efficiency  and  nodes’  performances.  Moreover, 
the  distributed  nodes  with  only  local  information  may  not  know 
the  cooperation  point,  even  if  they  are  willing  to  cooperate.  Hence, 
it  is  crucial  to  design  a  distributed  mechanism  for  enforcing 
and  learning  the  cooperation  among  the  greedy  nodes  in  packet 
forwarding.  In  this  paper,  we  propose  a  self-learning  repeated- 
game  framework  to  overcome  the  problem  and  achieve  the 
design  goal.  We  employ  self-transmission  efficiency  as  the  utility 
function  of  individual  autonomous  node.  The  self  transmission 
efficiency  is  defined  as  the  ratio  of  the  power  for  self  packet 
transmission  over  the  total  power  for  self  packet  transmission 
and  packet  forwarding.  Then,  we  propose  a  framework  to  search 
for  good  cooperation  points  and  maintain  the  cooperation  among 
selfish  nodes.  The  framework  has  two  steps:  First,  an  adaptive 
repeated  game  scheme  is  designed  to  ensure  the  cooperation 
among  nodes  for  the  current  cooperative  packet  forwarding 
probabilities.  Second,  self-learning  algorithms  are  employed  to 
find  the  better  cooperation  probabilities  that  are  feasible  and 
benefit  all  nodes.  We  propose  three  learning  schemes  for  different 
information  structures,  namely,  learning  with  perfect  observ¬ 
ability,  learning  through  flooding,  and  learning  through  utility 
prediction.  Starting  from  noncooperation,  the  above  two  steps 
are  employed  iteratively,  so  that  better  cooperating  points  can  be 
achieved  and  maintained  in  each  iteration.  From  the  simulations, 
the  proposed  framework  is  able  to  enforce  cooperation  among 
distributed  selfish  nodes  and  the  proposed  learning  schemes 
achieve  70%  to  98%  performance  efficiency  compared  to  that 
of  the  optimal  solution. 

I.  Introduction 

Some  wireless  networks  such  as  ad-hoc  networks  consist 
of  autonomous  nodes  without  centralized  control.  In  such 
autonomous  networks,  the  nodes  may  not  be  willing  to  fully 
cooperate  and  accomplish  the  network  task.  Specifically  for 
the  packet  forwarding  problem,  forwarding  the  others’  packets 
consumes  the  node’s  limited  battery  resource.  Therefore,  it 
may  not  be  of  the  node’s  best  interest  to  forward  others’ 
arriving  packets.  However,  rejection  of  forwarding  others’ 
packets  non-cooperatively  will  severely  affect  the  network 
functionality  and  impair  the  nodes’  own  benefits.  Hence,  it  is 
crucial  to  design  a  mechanism  to  enforce  cooperation  among 
greedy  nodes.  In  addition,  the  randomly  located  nodes  with 
local  information  may  not  know  how  to  cooperate,  even  if 
they  are  willing  to  cooperate. 
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The  packet  forwarding  problem  in  ad  hoc  networks  has  been 
extensively  studied  in  the  literature.  The  fact  that  nodes  act 
selfishly  to  optimize  their  own  performances  has  motivated 
many  researchers  to  apply  the  game  theory  [1],  [2]  in  solv¬ 
ing  this  problem.  Broadly  speaking,  the  approaches  used  in 
encouraging  the  packet  forwarding  task  can  be  categorized 
into  two  methods.  The  first  type  of  methods  makes  use  of 
virtual  payment.  Virtual  currency,  pricing,  and  credit  based 
method  [3],  [4]  fall  into  this  first  type.  The  second  type  of 
approaches  is  related  to  personal  and  community  enforce¬ 
ment  to  maintain  the  long-term  relationship  among  nodes. 
Cooperation  is  sustained  because  defection  against  one  node 
causes  personal  retaliation  or  sanction  by  others.  This  second 
approach  includes  the  following  works.  Marti  et  al.  [5]  propose 
mechanism  called  watchdog  and  pathrater  to  identify  the  mis¬ 
behaving  nodes  and  deflect  the  traffic  around  them.  Buchegger 
et  al.  [6]  define  protocols  based  on  reputation  system.  Altman 
et  al.  [7]  consider  a  punishment  policy  to  show  cooperation 
among  participating  nodes.  In  [8],  Han  et  al.  propose  learning 
repeated  game  approaches  to  enforce  cooperation  and  obtain 
better  cooperation  solutions.  Some  other  works  using  game 
theory  in  solving  communication  problems  can  be  found  in 
[9],  [10],  and  [11], 

Since  in  some  wireless  networks,  it  is  difficult  to  imple¬ 
ment  the  virtual  payment  system  because  of  the  practical 
implementation  challenges  such  as  enormous  signaling.  In  this 
paper,  we  concentrate  on  the  second  type  of  approaches  and 
design  a  mechanism  such  that  cooperation  can  be  enforced  in 
a  distributed  way.  In  addition,  unlike  the  previous  works  which 
assume  the  nodes  know  the  cooperation  points  or  other  nodes’ 
behaviors,  we  argue  that  randomly  deployed  nodes  with  local 
information  may  not  know  how  to  cooperate  even  if  they  are 
willing  to  do  so.  Motivated  by  these  facts,  we  propose  a  self¬ 
learning  repeated-game  framework  for  cooperation  enforcing 
and  learning. 

We  define  the  self-transmission  as  the  transmission  of  a 
user’s  own  packets.  We  quantify  the  node’s  utility  as  its  self¬ 
transmission  efficiency,  which  is  defined  as  the  ratio  of  the 
power  for  successful  self  transmission  over  the  total  power 
used  for  self  transmission  and  packet  forwarding.  The  goal 
of  the  node  is  to  maximize  the  long-term  average  efficiency. 
Using  this  utility  function,  a  distributed  self-learning  repeated- 
game  framework  is  proposed  to  ensure  cooperation  among 
autonomous  nodes.  The  framework  consists  of  two  steps:  First, 
the  repeated  game  enforces  cooperation  in  packet  forwarding. 
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Fig.  1.  Illustration  of  time-slotted  transmission  to  two  alternative  stages 

This  first  step  ensures  that  any  cooperation  equilibrium  that 
is  more  efficient  than  the  Nash  Equilibrium  (NE)  of  the 
one  stage  game  can  be  sustained.  The  repeated  game  allows 
nodes  to  consider  the  history  of  actions/reactions  of  their 
opponents  in  making  the  decision.  The  cooperation  can  be 
enforced/sustained  using  the  repeated  game,  since  any  devia¬ 
tion  causes  the  punishment  from  other  nodes  in  the  future. 
The  second  step  utilizes  the  learning  algorithm  to  achieve 
the  desired  efficient  cooperation  equilibrium.  We  propose 
three  learning  algorithms  for  different  information  structures, 
namely,  learning  with  perfect  observability,  learning  through 
flooding,  and  learning  through  utility  prediction.  Starting  from 
the  non-cooperation  point,  the  two  proposed  steps  are  applied 
iteratively.  A  better  cooperation  is  discovered  and  maintained 
in  each  iteration,  until  no  more  efficient  cooperation  point 
can  be  achieved.  From  the  simulation  results,  our  proposed 
framework  is  able  to  enforce  cooperation  among  selfish  nodes. 
Moreover,  compared  to  the  optimal  solution  obtained  by 
a  centralized  system  with  global  information,  our  proposed 
learning  algorithms  achieve  similar  performances  in  the  sym¬ 
metric  network.  Depending  on  learning  algorithms  and  the 
information  structures,  our  proposed  schemes  achieve  near- 
optimal  solution  in  the  random  network. 

This  paper  is  organized  as  follows:  In  Section  II,  we  give 
the  system  model  and  explain  the  design  challenge.  In  Section 
III,  we  propose  and  analyze  the  repeated-game  framework  for 
packet  forwarding  under  different  information  structures.  In 
Section  IV,  we  construct  self-learning  algorithms  correspond¬ 
ing  to  different  information  structures  in  details.  In  Section  V, 
we  evaluate  the  performances  of  our  proposed  scheme  using 
extensive  simulations.  Finally,  the  conclusions  are  drawn  in 
Section  VI. 

II.  System  Model  and  Design  Challenge 

We  consider  a  network  with  N  nodes.  Each  node  is  battery- 
powered  and  has  transmit  power  constraint.  This  implies  that 
only  nodes  within  the  transmission  range  are  neighbors.  The 
packet  delivery  typically  requires  more  than  one  hop.  In  each 
hop,  we  assume  transmission  occurs  in  a  time-slotted  manner 
as  illustrated  in  Figure  1.  The  source,  the  relays  (intermediate 
nodes),  and  the  destination  constitute  an  active  route.  We 
assume  an  end-to-end  mechanism  that  enables  a  source  node  to 
know  if  the  packet  is  delivered  successfully.  The  source  node 
can  observe  whether  there  is  a  packet  drop  in  one  particular 
active  path.  However,  the  source  node  may  not  know  where 
the  packet  is  dropped.  Finally,  we  assume  that  routing  decision 
has  already  been  done  before  optimizing  the  packet  forwarding 
probabilities1. 

'We  note  that  it  is  always  possible  for  nodes  to  do  manipulation  in  the 
routing  layer.  However,  it  is  beyond  the  scope  of  this  paper.  For  more 
information,  please  refer  to  [16] 


Let’s  denote  the  set  of  sources  and  destinations  as  {Sj, 
I), },  for  i  =  1,2,  •••  ,  M,  where  M  represents  the  number 
of  source-destination  pairs  that  are  active  in  the  network. 
Suppose  the  shortest  path  for  each  source-destination  pair 
has  been  discovered.  Fet’s  denote  the  route/path  as  Ri  = 
(Si,  fh- ,  fR.  ,■■■  ,  fii-,  Di),  where  S%  denotes  the  source  node, 
IJ,  denotes  the  destination  node,  and  {/r.  ,  •  , /jj  }  is 

the  set  of  intermediate/relay  nodes,  thus,  there  are  n  +  1  hops 
from  source  node  to  the  destination  node.  Let  V  =  { II,  :  i  = 
l,---  ,  M }  be  the  set  of  routes  corresponding  to  all  source- 
destination  pairs.  Fet’s  denote  further  the  set  of  routes  where 
node  j  is  the  source  as  Vf  =  {Ri  :  S(Ri)  =  j,i  =  1 . . .  M}, 
where  S(Ri)  represents  the  source  of  route  11, .  The  power 
expended  in  node  i  for  transmitting  its  own  packet  is 

=  Y,  ^S(r)  -K-d(S(r),n(S(r),r))\  (1) 

r£Vf 

where  ps(r)  is  the  transmission  rate  of  source  node  S(r),  K 
is  the  transmission  constant,  d(i,j)  is  the  distance  between 
node  i  and  node  j,  n(i ,  r)  denotes  the  neighbor  of  node  i 
on  route  r,  and  7  is  the  transmission  path-loss  coefficient. 
For  the  link  from  node  i  to  its  next  hop  n(i ,  r)  on  route  r, 
K  ■  d(i,  n(i,  r))7  describes  the  reliable  successful  transmission 
power  per  bit  transmission.  We  note  that  equation  (1)  can 
also  be  interpreted  as  the  average  signal  power  required  for 
successful  transmission  of  certain  rate  ps(r)-  This  implies  that 
the  transmission  failure  due  to  the  channel  fading  has  been 
taken  into  account  by  the  transmission  constant  K . 

Fet  an  for  i  =  1,  •  •  •  ,7V  be  the  packet  forwarding  proba¬ 
bility  for  node  i.  Here,  we  use  the  same  packet  forwarding 
probability  for  every  source-destination  pairs  because  of  the 
following  reasons.  First,  based  on  the  greedy  assumption  of 
the  nodes,  there  is  no  reason  for  one  particular  node  to 
forward  some  packets  on  some  routes  and  reject  forwarding 
other  packets  on  other  routes.  Second,  the  use  of  different 
packet  forwarding  probability  on  different  routes  will  only 
complicate  the  deviation  detection  of  a  node  and  it  will  not 
change  the  optimization  framework  proposed  in  this  paper. 
So  in  our  first  step  to  analyze  the  problem,  we  assume  the 
same  forwarding  probability  on  every  route.  In  the  future  work, 
we  are  also  exploring  the  case  where  the  nodes  use  different 
packet  forwarding  probability  for  different  routes. 

Clearly,  probability  of  successful  transmission  from  node 
i  to  its  destination  depends  on  the  forwarding  probabilities 
employed  in  the  intermediate  nodes  and  it  can  be  represented 
as 

ptx,-  =  n  “*•>  (2) 

je{r\{S(r)=i,D(r)}) 

where  D(r )  is  the  destination  of  route  i  and  (r  \  {S(r)  = 
i,D(r)})  is  the  set  of  nodes  on  route  r  excluding  the  source 
and  destination.  Fet’s  define  the  good  power  consumed  in 
transmission  node  i,  P^good  as  the  product  of  the  power  used 
for  transmitting  node  i’s  own  packet  and  the  probability  of 
successful  transmission  from  node  i  to  its  destination, 

Pslood  =  E  d'SR)  ■  K  ■  d(S(r),n(S(r),r)y P^x  r.  (3) 

rGVf 
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Moreover,  let  the  set  of  routes  where  node  j  is  the  forward¬ 
ing  node  be  Wj.  The  power  used  to  forward  others’  packets 
is  given  by 

Pfl)  =ati-  K  ■  ^2  d(*>n(*>r))7MS(r)-Pp,r>  (4) 

reWi 


where  P'F  r  is  the  probability  that  node  i  receives  the  packet 
to  forward  in  route  r,  and  YhrGW  Ps(r)  Pfr *s  the  total  rate 
that  node  i  receives  for  packet  forwarding.  The  probability  that 
node  i  receives  the  forward  packet  in  route  r  is  represented  as 


Pb,r  —  n  a:i  ■ 


(5) 


where  r  =  {5(r), /r\  •  •  •  ,  f™  \  f™  =  i,  D(r)}  is 

the  71  +  1  hops  route  from  source  S(r )  to  destination  D(r), 
and  the  mth  forwarding  node  f™  is  node  i.  P'F  r  depends  on 
the  packet  forwarding  probabilities  of  the  nodes  on  the  route 
r  before  node  i. 

We  refer  to  the  task  of  transmitting  the  node  own  infor¬ 
mation  as  self-transmission  and  the  task  of  relaying  others’ 
packets  as  packet  forwarding.  We  focus  on  maximizing  the 
self-transmission  efficiency ,  which  is  defined  as  the  ratio  of 
successful  self-transmission  power  (good  power)  over  the 
total  power  used  for  self-transmission  and  packet  forwarding. 
Therefore,  the  stage  utility  function  for  node  i  can  be  repre¬ 
sented  as 


[/■«(<*<,  <*-i) 


p(*) 

s,good 


P 


P, 


(6) 


where  a,  is  node  i's  packet  forwarding  probability,  a—i  = 
(cti,  •  •  •  ,  oti-i,  ctj+i,  •  •  •  ,  ajsr)T  are  the  other  nodes’  forward¬ 
ing  probability.  Putting  (1),  (3)  and  (4)  into  (6),  we  obtain 
(7).  Since  the  power  for  successful  self-transmission  depends 
on  the  packet  forwarding  used  by  other  nodes,  the  self¬ 
transmission  efficiency  captures  the  trade-off  between  the 
power  used  for  packet  transmission  of  its  own  information 
and  packet  forwarding  for  the  other  nodes. 

The  problem  in  packet  forwarding  arises  because  the  au¬ 
tonomous  nodes  such  as  in  ad-hoc  networks  have  their  own 
authorities  to  decide  whether  to  forward  the  incoming  packets. 
Under  this  scenario,  it  is  very  natural  to  assume  that  each 
node  selfishly  optimizes  its  own  utility  function.  In  parallel  to 
(7),  node  i  selects  a,  in  order  to  maximize  the  transmission 
efficiency  [AWa,;,  a_U.  This  implies  that  node  i  will  selfishly 
minimize  Pf  ,  the  portion  of  energy  used  to  forward  others’ 
packets.  In  the  game  theory  literature  [1],  [2],  Nash  Equilib¬ 
rium  (NE)  is  a  well-known  concept,  which  states  that  in  the 
equilibrium  every  node  selects  the  best  response  strategy  to  the 
other  nodes’  strategies.  The  formal  definition  of  NE  is  given 
as  follow 

Definition  1:  Define  feasible  range  ft  as  [0, 1],  Nash  Equi¬ 
librium  [af,  ■  ■  ■  ,  Q*n]t  is  defined  as: 


Va4  €  ft,  (8) 


Unfortunately,  the  NE  for  the  packet  forwarding  game  de¬ 
scribed  in  (7)  is  a*  =  0,  V).  This  can  be  verified  by  finding  the 
forwarding  probability  a,  e  [0, 1]  such  that  U^l>  is  unilaterally 
maximized.  To  maximize  the  transmission  efficiency  of  node 
i,  the  node  can  only  make  the  forwarding  energy  I  f  as 
small  as  possible.  This  is  equivalent  to  setting  a,  as  small 
as  possible,  since  the  successful  probability  of  its  own  packet 
transmission  in  (2)  depends  only  on  the  other  nodes’  willing¬ 
ness  to  forward  the  packets.  By  greedily  dropping  its  packet 
forwarding  probability,  node  i  reduces  its  total  transmission 
power  used  for  forwarding  others’  packets,  therefore,  increases 
its  instantaneous  efficiency.  However,  if  all  nodes  play  the 
same  strategy,  this  causes  zero  efficiency  in  all  nodes,  i.e., 
U^l\a\  . . .  a*N)  =  0,  Vi.  As  the  result,  the  network  breaks 
down.  Hence,  playing  NE  is  inefficient  not  only  from  the 
network  point  of  view  but  also  for  the  individual’s  own  benefit. 
It  is  very  important  to  emphasize  that  the  inefficiency  of  NE  is 
independent  to  the  utility  function  in  (7).  This  inefficiency  is 
merely  the  result  of  greedy  optimization  unilaterally  done  by 
each  of  the  nodes.  In  the  next  two  sections,  we  propose  a  self¬ 
learning  repeated-game  framework  and  show  how  cooperation 
can  be  enforced  using  our  proposed  scheme. 

III.  Repeated-Game  Framework  and  Punishment 
Analysis 

As  demonstrated  in  Section  II,  the  packet  forwarding  game 
has  a*  =  0,  Vi  as  its  unique  Nash  equilibrium  if  the  game  is 
only  played  once.  This  implies  that  all  nodes  in  the  network 
won’t  be  cooperating  in  forwarding  the  packets.  In  practice, 
nodes  typically  participate  in  the  packet  forwarding  game  for 
a  certain  duration  of  time,  and  this  is  more  suitably  modelled 
as  a  repeated  game  (a  game  that  is  played  in  multiple  times). 
If  the  game  never  ends,  it  is  called  infinite  repeated  game 
which  we  will  use  in  this  paper.  In  fact,  the  repeated  game 
may  not  be  necessarily  infinite.  The  important  point  is  that 
the  nodes/players  do  not  know  when  the  game  ends.  In  this 
sense,  the  properties  of  the  infinitely  repeated  game  can  still 
be  valid.  In  this  paper,  we  employ  the  normalized  average 
discounted  utility  with  discounting  factor  <5  given  by: 

OO 

t/W  =  lim  U®  =  (1  -  5)  V  (9) 

t'—>  OO  ^ ' 

t=l 

where  a(t)  =  {a.\, . . .  ,  ajv)T,  is  the  utility  of  node 

i  at  each  stage  game  (7)  played  at  time  t,  and  UF>  is  the 
normalized  average  discounted  utility  from  time  1  to  time  t! . 
Unlike  the  one-time  game,  the  repeated  game  allows  a  strategy 
to  be  contingent  on  the  past  moves  and  results  in  the  reputation 
and  retribution  effects,  so  that  cooperation  can  be  sustained 
[2],  [13],  [14].  We  also  note  that  the  utilities  in  (7)  and  (9) 
are  indeed  heterogeneous  in  the  sense  that  they  carry  the 
information  about  the  channel,  routing,  and  node  behaviors.  In 
other  words,  the  utility  functions  in  (7)  and  (9)  reflect  different 
energy  consumption  according  to  different  distance,  rate,  and 
route  between  nodes. 


i.e.,  given  that  all  nodes  play  NE,  no  node  can  improve  A-  Design  of  Punishment  Scheme  under  Perfect  Observability 
its  utility  by  unilaterally  changing  its  own  packet  forward  In  this  subsection,  we  analyze  a  class  of  punishment  policy 
probability.  Here  a*_{  =  (aj,  •  •  •  ,  a*_1,  a*+1,  ■  ■  ■  ,  a*N)T .  under  the  assumption  of  perfect  observability.  Perfect  observ- 
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c/«  = 


_ Ergvy  ^S(v)d{S{r),  n(S(r),r))~>  IIje(r\{S(r)=i,z?(r)})  ai _ 

Erev/  Hs(r)d(S(r),n(S(r),r))-y  +  on  Erew,  d(i,n(i,r))'r  gS{r)  Uje{ff-  1}  ai 


(7) 


ability  means  that  each  node  is  able  to  observe  actions  taken 
by  other  nodes  along  the  history  of  the  game.  This  implies  that 
node  knows  which  node  drops  the  packet  and  is  aware  of  the 
identity  of  other  nodes.  This  condition  allows  every  node  to 
detect  any  defection  of  other  nodes  and  it  also  allows  nodes  to 
know  if  any  node  does  not  follow  the  game  rule.  The  perfect 
observability  is  the  ideal  case  and  serves  as  the  performance 
upper  bound.  In  the  next  subsection,  this  assumption  is  relaxed 
to  a  more  practical  situation,  where  an  individual  node  only 
has  limited  local  information. 

Let’s  denote  the  NE  in  one  stage  forwarding  game  as 
a*  =  (o^,  •  •  •  ,  a*N)T,  and  the  corresponding  utility  functions 
as  «,•••  ,v*n)t  =  (f/W (a*),--  -  ,[/W(a*))T.  We  also 
denote 

u  =  {(«!,■■■  ,vN)\  3denN  (10) 

s.t.  («!,-■■  ,vN)  =  ,u{N\d))}, 

V  =  convex  hull  of  U,  (11) 

vt  =  {(«!,-■■  ,vN)  eV|  Vi  >v*,  Vi}.  (12) 

We  note  that  V  consists  of  all  feasible  utilities,  and  \D 

consists  of  feasible  utilities  that  Pareto-dominate  the  one  stage 
NE,  this  set  is  also  known  as  the  individually  rational  utility 
set  [1],  [2],  The  Pareto-dominant  utilities  denote  all  utilities 
that  are  strictly  better  than  the  one  stage  NE.  From  the  game 
theory  literature  [2],  [13],  [14],  the  existence  of  equilibria  that 
Pareto-dominate  the  one  stage  NE  is  given  by  the  Folk  theorem 

[14]. 

Theorem  1  ( Folk  Theorem  [14]):  Assume  that  the  dimen¬ 
sionality  ofVt  equals  to  N.  Then,  for  any  (iq,  •  •  •  ,  vn)  in  VL 
there  exists  5  £  (0, 1)  such  that  for  all  S  £  ( 5 , 1),  there  exists 
an  equilibrium  of  the  infinitely  repeated  game  with  discounted 
factor  S  in  which  player  T s  average  utility  is  tq. 

Before  we  give  the  application  of  Folk  theorem  in  the 
packet  forwarding  game,  it  is  useful  to  recall  the  notion 
of  dependency  graph.  Given  the  routing  algorithm  and  the 
source-destination  pairs,  the  dependency  graph  is  the  directed 
graph  that  is  constructed  as  follows.  The  number  of  nodes  in 
the  dependency  graph  is  the  same  as  the  number  of  nodes 
in  the  network.  When  node  i  sends  packets  to  node  j  via 
nodes  f1,  ■  ■  ■  ,  /”,  then  there  exist  directed  edges  from  node 
i  to  nodes  /* .  •  ■  •  ,/".  The  resulting  dependency  graph  is  a 
directed  graph,  which  describes  the  node  dependency  in  per¬ 
forming  the  packet  forwarding  task.  Let’s  define  deg.in(i )  and 
degout  ( i )  as  the  number  of  edges  going  into  node  i  and  coming 
out  from  node  i,  respectively.  Obviously,  degin(i)  indicates  of 
the  number  of  nodes  whose  packets  are  forwarded  by  node 
i  and  degout(i )  is  the  number  of  nodes  that  help  forward 
node  i’ s  packets.  Using  the  notation  of  the  corresponding 
dependency  graph,  the  application  of  Folk  theorem  in  packet 
forwarding  game  is  stated  as  follow: 

Theorem  2:  (Existence  of  Pareto-dominant  forwarding 
equilibria  under  perfect  observability) 


Under  the  following  conditions 

1.  the  game  is  perfectly  observable; 

2.  the  corresponding  dependency  graph  satisfies  the 
condition 

degout(i)  0,  Vi,  (13) 

3.  V"1  has  full  dimensionality  (V  has  dimensionality 
of  N ).  We  note  that  Vt  has  dimensionality  of  N 
implies  that  the  space  formed  by  all  points  in  Vt 
has  the  dimensionality  of  N. 

Then,  for  any  ,vn )  £  Vt,  there  exists  6  £  (0,1), 

such  that  for  all  <5  £  ( 5 , 1),  there  exists  an  equilibrium  of  the 
infinitely  repeated  game  with  node  i’s  average  utility  vt . 

Proof:  Let  a  =  (aq,---  ,  ctAr)T  be  the  joint  strategy 
results  in  ([/ID (a), . . .  ,U(N\a)).  The  full  dimensionality 
condition  ensures  the  set  ([/(D(a),  •  •  •  ,U^~1^(a),U^(d)  — 
e,  [/(•7+1)(a),  •  •  •  ,  U(N\d))  for  any  e  >  0,  is  in  VD  Let 
node  i’s  maximum  utility  be  tq  =  max„  Vi.  This 

maximum  utility  is  obtained  when  all  nodes  try  to  maxi¬ 
mize  node  i’s  utility.  Let  the  cooperating  utility  be  tq  = 
u^(d)  £  vt  ,  Vi.  The  cooperating  utilities  are  obtained  when 
all  nodes  play  the  agreed  packet  forwarding  probabilities.  Let 
the  maximum  utility  node  i  can  get  when  it  is  punished  be 
Uj  =  maxaj  minQ  U (i)  (a).  Let’s  denote  node  j’s  utility 
when  punishing  node  i  as  uf .  We  note  that  from  (7),  the 
max-min  utility  vi  coincides  with  the  one  stage  NE.  If  there 
exist  e  and  the  punishment  period  for  node  i,  Ti,  such  that 

W) +  (14) 

then  the  following  rules  ensure  any  individually  rational  util¬ 
ities  can  be  enforced. 

1)  Condition  I:  All  nodes  play  cooperation  strategies 
if  there  is  no  deviation  in  the  last  stages.  After  any 
deviations  go  to  Condition  II  (Suppose  node  j  deviates). 

2)  Condition  II:  Nodes  that  can  punish  the  deviating  node 
(node  j )  play  the  punishing  strategies  for  the  punishment 
period.  The  rest  of  the  nodes  keep  playing  cooperating 
strategies.  If  there  is  any  deviation  in  Condition  II,  restart 
Condition  II  and  punish  the  deviating  node.  If  any  pun¬ 
ishing  node  does  not  play  punishment  in  the  punishment 
period,  the  other  nodes  will  punish  that  particular  node 
during  the  punishment  period.  Otherwise,  after  the  end 
of  the  punishment  period,  go  to  Condition  III. 

3)  Condition  III:  Play  strategy  that  results  in  utility 
([/*')....  .(/(i  'U/U)  _  e,[/0'+ D,...,  [/W).  If 
there  is  any  deviation  in  Condition  III,  start  Condition 
II  and  punish  the  deviating  node. 

First,  the  cooperating  strategy  is  the  strategy  that  all  nodes 
agree  upon.  In  contrast,  the  punishing  node  i  strategy,  is 
the  strategy  that  results  in  max-min  utility  in  node  i,  = 
maxa.  mina  -  f7(i)(a).  In  the  sequel,  we  show  that  under  the 
proposition’s  assumptions: 
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•  the  average  efficiency  gained  by  the  deviating  node  is 
smaller  than  the  cooperating  efficiency, 

•  the  average  efficiency  gained  by  the  punishing  node  that 
does  not  play  the  punishment  strategy  in  the  punishment 
stage  is  worse  than  the  efficiency  gained  by  that  node 
when  it  conforms  to  the  punishing  strategy. 

If  node  j  deviates  in  Condition  I  and  then  conforms,  it 
receives  at  most  Vj  when  it  deviates,  Vj  for  7}  periods 
when  it  is  punished,  and  (U^  —  e)  after  it  conforms  to  the 
cooperative  strategy.  The  average  discounted  deviation  utility 
can  be  expressed  as: 

Ug  =  vj  +  S{\~_ ^ )vj  +  ^(UU)  -  e).  (15) 


Since  if  the  node  conforms  throughout  the  game,  it  has  the 
average  discounted  utility  of  So  the  gain  of  deviation 

is  given  by: 


A  UU)  =  Utt)  - 


1  -<5 


Uu) 


<  Vi  +  ^  -  1  .  ^+1  -  e). 


1 


-3 


1-5 


(16) 


We  note  that  Vj  coincides  with  the  one  stage  NE,  which  is 
Vj  =  0 ,  Vj.  As  5  — >  1,  1~1l5_^+  tends  to  1  +  Tj.  Under  the 
condition  of  (14),  the  deviation  gain  in  (16)  will  be  strictly 
less  than  zero.  This  indicates  that  the  average  cooperating 
efficiency  is  strictly  larger  than  the  deviation  efficiency.  Hence, 
any  rational  node  will  not  deviate  from  the  cooperation  point. 

If  the  punished  node  still  deviates  in  the  punishment  period, 
the  punishment  period  (Condition  II)  restarts  and  the  punish¬ 
ment  duration  experienced  by  the  punished  node  is  lengthened. 
As  the  result,  deviation  in  the  punishment  period  postpones 
the  punished  node  from  receiving  the  strictly  better  utility 
([/«  -  e)  in  Condition  III.  Hence,  it  is  better  not  to  deviate 
in  the  punishment  stage. 

On  the  other  hand,  if  punishing  node  i  does  not  play  the 
punishing  strategy  during  the  punishment  of  node  j,  node  i 
receives  at  most 


u°°  —  ut  '  \  —  8  ~l 


fiT+1 

+  —5{JJ{1)~£)-  (1?) 


However,  if  node  i  conforms  with  the  punishment  strategy,  it 
will  receive  at  least 


(1-n 
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wj  + 


6T+i 
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(18) 


Here  vuj  is  the  utility  of  node  i  to  punish  node  j.  Therefore, 
node  i’s  reward  for  carrying  out  the  punishment  is  (18)  minus 
(17), 

u£-u{£=  +  (19) 

Using  Uj-  =  0,V«  and  let  5  — >  1,  the  expression  (19)  is 
equivalent  to 

Ug  ~Ug  =  T-  w{  -vi  +  ^-§.  (20) 

By  selecting  5  close  to  one,  this  expression  can  be  always 
larger  than  zero.  As  the  result,  the  punishing  node  always 
conforms  to  the  punishment  strategy  in  the  punishment  stage. 


(5 ) - (4) 

Original  Graph 


Traffic  flows: 

1  >2  >3  3  >4  -5  5  >6  -1 

1— 6—5  3  >2  -1  5  >4  -3 

2- 3  -4  4—5—6  6—1—2 

2—1—6  4—3—2  6—5—4 

[/W  amad(i-Z,6)+l  (0  +  amod(i.6Hl  (0 

'  2  +  2O,(0 


Fig.  2.  Example  of  the  punishment  scheme  under  perfect  observability 


The  same  argument  of  no  node  deviating  in  Condition  I 
can  be  used  to  show  that  no  nodes  deviates  in  Condition  III. 
Therefore,  we  conclude  that  deviations  in  all  Conditions  are 
not  profitable.  ■ 

The  proof  above  is  based  on  two  conditions:  First,  the  proof 
assumes  that  there  always  exist  nodes  that  can  punish  the  devi¬ 
ating  nodes,  this  is  guaranteed  by  the  assumption  degout(i )  > 
0  in  the  corresponding  dependency  graph.  Secondly,  nodes  are 
able  to  identify  which  node  is  defecting  and  which  node  does 
not  carry  out  the  punishment.  This  is  guaranteed  by  the  perfect 
observability  assumption.  The  strategy  of  punishing  those  who 
misbehave  and  those  who  do  not  punish  the  misbehaving  nodes 
can  be  an  effective  strategy  to  cope  with  the  collusion  attack. 

Now  let’s  consider  the  following  example  to  understand  the 
punishment  behavior.  We  assume  Hs(r)  =  1,  K  =  1,  and 
d(i,j )  =  1.  The  resulting  utilities  are  shown  in  Figure  2.  Each 
node  has  the  one  stage  utility  as: 

jr(i)  _  amod(z-2,6)+i  +  amod(i,6)+i 

By  selecting  the  discounted  factor,  5  =  0.9  and  T  =  2 
appropriately,  all  nodes  are  better-off  when  they  are  cooper¬ 
ating  in  packet  forwarding  by  setting  a,  =  1,  Vi.  If  all  nodes 
conform  to  the  cooperative  strategies,  the  6-stage  normalized 
average  discounted  utilities  defined  in  (9)  are  given  by  Uf-'1  = 
0.2343,  Vi.  In  Figure  2,  we  plot  the  utility  functions  and 
forwarding  probabilities  of  all  nodes.  The  x-axis  of  the  plot 
denotes  the  round  of  game,  the  left  y-axis  denotes  the  value 
of  node’s  utility,  and  the  right  y-axis  denotes  the  value  of 
forwarding  probability.  The  forwarding  probability  is  denoted 
by  the  squared  plot  and  the  utility  function  is  denoted  by  the 
plot  with  stars. 

In  Figure  2,  we  show  that  node  1  is  deviating  in  the  second 
round  of  the  game  by  setting  its  forwarding  probability  to 
zero.  At  this  time,  node  l’s  utility  changes  from  0.5  to  1  as 
seen  in  the  figure.  As  the  consequence,  node  2  and  node  6 
are  punishing  node  1  at  the  following  T  =  2  stages  by  setting 
their  forwarding  probabilities  to  zeros.  In  the  third  round  of 
the  game,  node  1  has  to  return  to  cooperation.  Otherwise,  the 
punishment  from  others  restarts  and  consequently  the  average 


6 


discounted  utility  will  be  further  lowered.  After  the  punish¬ 
ment,  all  nodes  come  back  to  the  cooperative  forwarding 
probabilities  (as  shown  in  the  figure).  The  resulting  6-stage 
normalized  average  utilities  are  as  follows  =  0.2023, 
Ug(2)  =  f?g6)  =  0.2887,  f?g3)  =  U^]  =  0.1958,  and 
=  0.2343.  So  node  1  has  less  utility  by  deviation  than 
by  cooperation.  Moreover,  if  both  node  2  and  node  6  fail  to 
punish  node  1,  they  will  be  punished  by  other  nodes  during 
the  following  T  periods  of  game.  The  resulting  normalized 
average  utilities  are  f/g1'*  =  0.3485,  f/g2'*  =  Ug  =  0.1425, 
[7g(3)  =  u£]  =  0.3035,  and  uf>  =  0.165.  Therefore,  node  2 
and  node  6  will  carry  out  the  punishment,  since  otherwise 
they  will  in  turn  be  punished  and  have  less  utility.  The 
same  argument  can  be  used  to  prevent  nodes  deviating  from 
the  punishment  strategy.  We  note  that  in  this  example  the 
corresponding  dependency  graph  has  degin(i)  =  degout{i )  = 
2 ,  Vi.  Therefore,  there  are  always  punishing  nodes  available 
whenever  any  node  deviates. 

Finally,  we  discuss  the  discounting  factor  6  which  represents 
the  importance  of  the  future.  In  the  case  where  the  discounting 
factor  is  small,  the  future  is  less  important.  This  will  cause 
the  pathological  situation  where  the  instantaneous  deviation 
gain  of  the  defecting  node  exceeds  any  future  punishment 
by  the  other  nodes.  Therefore,  it  is  better-off  for  the  node 
to  deviate  rather  than  to  cooperate  and  it  becomes  very  hard 
(if  not  impossible)  to  encourage  all  nodes  to  cooperate  in 
this  scenario.  We  also  note  that  the  selfish  nodes  are  better- 
off  to  choose  the  S  approaching  to  one.  Since  if  the  node 
chooses  6  that  closes  to  zero,  this  implies  that  the  future  is 
not  important  to  the  node,  the  node  will  definitely  ask  other 
nodes  for  transmitting  his  own  packet  at  very  beginning  of 
the  game  and  stop  forwarding  others’  packets  afterward.  This 
will  invoke  punishment  from  its  neighboring  nodes  by  not 
forwarding  that  particular  node’s  packets.  This  implies  that 
that  node  will  automatically  be  excluded  from  the  network. 
Therefore,  it  is  better-off  for  nodes  in  the  network  to  choose 
6  approaching  to  one. 

B.  Design  of  Punishment  Scheme  under  Imperfect  Local  Ob¬ 
servability 

We  have  shown  that  under  the  perfect  observability  assump¬ 
tion,  the  packet  forwarding  game  along  with  the  punishment 
scheme  can  achieve  any  Pareto-dominant  efficiency.  However, 
the  perfect  observability  may  be  difficult  to  implement  in  ad- 
hoc  networks,  due  to  the  enormous  overheads  and  signaling. 
Therefore,  we  try  to  relax  the  condition  of  the  perfect  ob¬ 
servability  in  this  subsection.  There  are  many  difficulties  in 
removing  the  perfect  observability  assumption.  Suppose  each 
node  observes  only  its  own  history  of  stage  utility  function. 
In  this  situation,  the  node  knows  nothing  about  what  has 
been  going  on  in  the  rest  of  the  network.  The  node  only 
knows  the  deviation  of  nodes  on  which  it  relies  on  to  do 
packet  forwarding.  And  it  cannot  detect  the  deviation  in  the 
other  part  of  the  network,  even  though  it  can  be  the  one  that 
can  punish  the  deviating  node.  Therefore,  it  is  impossible 
to  implement  the  Folk  Theorem  in  this  information  limited 
situation.  Moreover,  nodes  may  not  know  if  the  system  is  in 


punishment  stage  or  not.  As  soon  as  one  of  the  nodes  sees  the 
deviation,  it  starts  the  punishment  period.  This  will  quickly 
start  another  punishment  stage  by  other  nodes,  since  the  nodes 
cannot  differentiate  if  the  change  in  stage  efficiency  is  caused 
by  the  punishment  stage  or  the  deviating  node.  As  the  result, 
the  defection  spreads  like  an  epidemic  and  cooperation  in  the 
whole  network  breaks  down.  This  is  known  as  the  contagious 
equilibrium  [13].  Indeed,  the  only  equilibrium  in  this  situation 
is  the  one  stage  NE. 

The  main  reason  of  the  contagious  equilibrium  is  that  all 
nodes  have  the  inconsistent  beliefs  about  the  state  of  the 
system,  they  do  not  know  whether  the  system  is  currently 
in  the  punishment  stage,  the  deviation  state,  or  the  end  of 
punishment  stage.  Therefore,  any  mistake  in  invoking  the 
punishment  stage  can  cause  the  contagious  equilibrium.  The 
lack  of  the  consistent  knowledge  of  the  system  state  can  be 
mitigated  using  communications  between  nodes.  Suppose  each 
node  observes  only  a  subset  of  the  other  nodes’  behaviors. 
The  communication  is  introduced  by  assuming  that  each  node 
makes  a  public  announcement  about  the  behaviors  of  the  nodes 
it  observes.  This  public  announcement  can  be  implemented  by 
having  the  nodes  exchange  the  behaviors  of  nodes  they  observe 
through  broadcasting.  The  intersection  of  these  announcements 
can  be  utilized  to  identify  the  deviating  node.  At  the  end  of 
each  stage  game,  the  nodes  report  either  no  nodes  deviate  or 
the  identity  of  the  deviating  node.  Since  these  announcements 
can  be  exchanged  in  a  relatively  low  frequency  and  only  to  the 
related  nodes,  the  communication  overheads  are  limited.  Under 
this  local  observability  assumption,  the  following  theorem 
inspired  by  the  Folk  Theorem  for  privately  monitoring  with 
communication  [15]  is  proposed 

Theorem  3:  Suppose  V  '  has  N  dimensionality  (full  dimen¬ 
sionality),  where  N  is  the  number  of  nodes  in  the  network. 
If  every  node  i  is  monitored  by  at  least  two  other  nodes,  this 
implies  the  following: 

1.  If  node  i  participates  in  the  routes  that  have  only  2 
hops,  then  deglrl(i)  >  2  is  sufficient. 

2.  If  node  i  participates  in  the  routes  which  one  of 
the  routes  has  only  2  hops,  then  degin(i )  >  3  is 
sufficient. 

3.  If  node  i  participates  in  the  routes  which  have  more 
than  2  hops,  then  degin(i)  >  4  is  sufficient. 

Also,  there  always  exists  a  node  that  can  punish  the  deviating 
node,  i.e., 

deg out (i)  >  0,  V).  (22) 

Moreover,  the  monitoring  nodes  can  exchange  the  observa¬ 
tions.  Then,  for  every  v  in  the  interior  of  V'',  there  exist 
5  £  (0, 1),  such  that  for  all  <5  £  (5, 1),  v  =  (t>i,  •  •  •  ,  Vn )  is  an 
equilibrium  of  an  infinitely  repeated  game  in  which  node  j’s 
average  utility  is  Vi. 

Proof:  Suppose  there  exist  e,  S  and  punishment  period 
Ti  such  that  (14)  holds  and 

maxi{Ti}  — 1 

max] max  (i’i(a)  —  t>,:(c/))} 

i  (a, a/)  v 

t=0  K  5  J 

oo 

<  y,  st£ >  (23> 

t=m.a,Xi{Ti} 


7 


then  the  following  rules  of  the  game  (Condition  I  to  III) 
achieves  the  equilibrium  when  degin(i )  =  2,  Mi. 

Condition  I:  If  there  is  no  announcement  of  the  deviating 
nodes 

a.  If  the  previous  stage  is  in  cooperating  state,  continue  the 
cooperating  state. 

b.  If  the  nodes  play  the  following  strategy  in  the  previous 
stage 

(t/(1),  •  •  •  ,  U^-V,  U{k)  -  £,  u{k+1\  •  •  •  ,  U(N)) 

for  k  £  {1,  •  •  ■  ,  iV},  continue  the  previous  state. 

c.  If  the  previous  stage  is  in  punishing  node  k  state  and  the 
punishment  has  not  ended,  then  continue  the  punishing. 
Otherwise,  switch  to  strategy  that  results  in 

(£/«.•  •  ,[/(*-!),  [/«  ,U{N)). 

Condition  Illf  node  j  is  incriminated  by  both  of  its  monitors 

ji  and  j2 

a.  If  the  previous  stage’s  strategy  is  either  in  the 

following  states:  punishing  node  j,  implementing 

-£,C/(J+1V--  imple¬ 
menting  (U^\  ■  ■  •  ,  —£,•••  ,  —£,•••  ,  U^), 

for  some  l  ^  j,  or  in  implementing  + 

£,  •  •  •  ,  U&)  —£,■■■  ,  f/W),  for  some  l  ^  j.  then  start 
the  punishment  stage  for  punishing  node  j. 

b.  If  the  previous  stage’s  strategy  is  in  punishing 

node  j i,  then  switch  to  the  strategy  that  results  in 
([/W,---  ,U^  +£,■■■  ,U^ -£,■■■  ,(7W).The  sim¬ 
ilar  argument  is  applied  to  increase  node  j i’s  utility  by  £ 
when  node  j2  is  punished  in  the  previous  stage. 

Condition  III:  If  there  is  any  inconsistent  announcement 
by  node  j\  and  j2.  We  note  that  the  inconsistent 
announcement  happens  when  there  are  at  least  two 
announcements  of  the  deviation  node,  but  the  devia¬ 
tion  nodes  in  the  announcements  are  different. 

a.  If  the  previous  state  is  punishing  node  j-\  or  node  j2,  then 
restart  the  punishment  stage. 

b.  Otherwise,  implement  (U^,  •  •  •  ,  —£,•••  ,  U^2'>  — 

£,•••  ,E/(A°). 

In  the  above  rules,  we  consider  three  different  conditions, 
namely  when  no  announcement  of  deviating  node  (Condition 
I),  when  the  announcements  are  consistent  (Condition  II),  and 
when  the  announcements  are  inconsistent  (Condition  III).  Then 
we  discuss  the  different  strategies  for  different  states  within 
each  Condition.  We  note  that  only  the  nodes  whose  packets  are 
forwarded  by  node  j  have  the  potential  ability  of  detecting  the 
deviation  of  node  j.  The  above  game  rule  ensures  that  if  every 
nodes  in  the  network  are  monitored  by  at  least  two  other  nodes 
and  there  always  exist  nodes  to  punish  the  deviating  node,  then 
any  v  £  vt  can  be  realized. 

If  both  the  monitors  (node  j\  and  node  j2)  of  node  j 
incriminate  node  j,  then  node  j  is  punished  in  the  similar 
way  to  the  punishment  in  Theorem  1.  The  deviator  (node  j ) 
is  punished  for  a  certain  period  of  time  if  the  previous  state 
is  in  one  of  the  following  states:  punishing  node  j  state  (this 
implies  that  the  punishment  stage  will  be  restarted),  finished 
punishing  node  j  state  (i.e.  in  the  state  with  utility  function 


as  C/C1),  -  -  -  -  £,(70+1)^..  ^C/W),  after  pe¬ 

nalizing  nodes  that  make  inconsistent  announcements  (i.e.  in 
state  with  utility  U^l\-  ■  ■  ,  U ^  —£,•••  ,  —£,•••  ,  U^), 

where  node  k  and  l  are  the  nodes  that  previously  make  incon¬ 
sistent  announcements,  or  in  state  with  utility  C/C1) ,  •  •  •  ,  U ^  + 
£,  •  •  •  ,  C/Cfc)  —  £,  •  •  •  ,U(N\  In  all  these  states,  the  deviator 
(node  j )  will  be  punished  for  a  certain  period  of  time  (Con¬ 
dition  Ha).  However,  if  the  previous  state  is  in  punishing 
node  j i,  then  the  system  switches  to  strategy  that  results  in 
C/C1),  -  -  -  ,  U(32>  +  £,•••,  t/CZ  -£,...  ,  C/W  (Condition  lib). 
This  strategy  gives  additional  incentives  (U^2>  £)  for  node 

j2  to  punish  to  node  j.  Obviously,  node  j)  has  the  incentive  to 
announce  if  node  j  deviates,  since  this  announcement  will  end 
node  ji  punishment.  Because  of  the  possible  early  termination 
of  the  punishment  period,  node  j i  also  has  the  incentive 
to  wrongly  incriminate  node  j,  this  particular  case  will  be 
prevented  by  Condition  Ilia.  Condition  lib  is  also  used  to 
avoid  the  situation  where  node  j2  lies  on  its  announcement 
even  though  it  observes  that  node  j  deviates.  This  condition 
will  become  obvious  as  we  discuss  the  Condition  III. 

Next,  we  consider  the  case  where  there  are  incompatible 
announcements.  We  note  that  incompatible  announcements 
imply  that  there  are  two  nodes  or  two  groups  of  nodes 
that  make  different  announcements  on  the  deviation.  These 
announcements  can  be  in  the  forms  of  either  node  j  is 
only  incriminated  by  one  of  the  nodes  (a  group  of  nodes) 
or  two  different  nodes  are  incriminated  by  two  other  nodes 
(two  other  groups  of  nodes).  When  there  are  incompatible 
announcements  about  node  j  (Condition  III)  and  the  previous 
state  is  not  in  punishing  node  j-\  or  j2,  the  nodes  that  make 
incompatible  announcements  will  be  penalized  and  they  will 
receive  utility  —  e  for  i  =  1,2  (Condition  IHb).  In  the 
case  when  node  j-\  is  being  punished  in  the  previous  stage,  the 
Condition  Ilia  prevents  node  j i  from  falsely  accusing  node  j. 
Condition  Ilia  and  Condition  IHb  are  sufficient  to  avoid  lying 
in  announcement.  However,  including  Condition  Ilia  creates 
the  situation  where  node  j2  enjoy  punishing  node  j  \ .  This 
means  that  when  node  j i  is  being  punished  and  in  the  case 
node  j  has  really  deviated,  node  j2  has  the  incentive  to  lie  in  its 
announcement  and  announces  that  no  nodes  is  deviating.  This 
problem  is  solved  by  Condition  lib  that  gives  additional  reward 
for  node  j2  to  tell  the  truth  and  punish  node  j.  Moreover  (23) 
implies  that  this  additional  reward  for  node  j2  outweighs  the 
benefit  from  punishing  node  j-\ .  (23)  can  be  thought  as  the 
incentives  for  the  monitoring  nodes  to  punish  the  deviating 
node  when  the  announcements  are  inconsistent. 

Previous  arguments  ensure  that  if  every  nodes  in  the  network 
are  monitored  by  at  least  two  other  nodes,  then  any  feasible 
v  G  Vt  can  be  realized.  Next,  we  analyze  the  three  cases 
listed  in  Theorem  3.  In  the  first  case,  if  all  routes  that  node  i 
participates  have  only  2  hops,  and  degin(i )  >  2,  this  implies 
that  every  node  can  be  perfectly  monitored  by  two  or  more 
nodes.  It  is  obvious  that  the  above  game  rules  can  be  applied 
directly.  In  the  second  case  when  node  i  participates  in  routes 
with  one  of  the  routes  of  exactly  2  hops,  and  degin(i )  >  3, 
both  the  announcement  from  the  source  of  the  2-hop  route 
and  the  aggregate  announcements  from  the  sources  of  the  rest 
of  the  routes  serve  as  the  final  announcements.  We  note  that 


Fig.  3.  Suppose  the  victim  node,  S,  is  in  the  edge  of  the  network  and  every 
transmission  coming  from  node  S  should  go  through  node  /.  Suppose  node 
/  deviates  and  blocks  the  announcement  from  S.  Node  S  can  increase  the 
transmission  power  to  bypass  node  /  to  broadcast  the  announcement. 

the  intersection  of  the  aggregate  announcements  will  do  the 
incrimination  on  a  certain  node.  The  node  that  does  not  tell  the 
truth  can  be  determined  by  majority  voting  method.  Finally, 
for  the  case  where  node  %  participates  in  the  routes  which  have 
more  than  2  hops  and  degin(i )  >  4,  the  sources  can  form  two 
groups  and  use  the  previous  game  of  rule.  The  lying  node  will 
be  detected  using  majority  voting.  In  summary,  any  potential 
deviation  in  the  network  satisfying  the  conditions  of  Theorem 
3  can  be  detected.  Moreover,  the  game  rules  guarantee  that 
any  feasible  rational  utilities  can  be  enforced.  ■ 

We  note  that  from  the  announcement  forwarder  perspec¬ 
tive,  it  faces  two  scenarios,  namely  either  the  announcement 
contains  negative  information  about  the  forwarder  itself  or  it 
contains  negative  information  about  the  other  nodes.  In  the  first 
case,  the  forwarding  node  may  not  forward  the  announcement, 
however,  even  though  that  node  itself  does  not  forward  the 
announcement,  there  is  only  a  small  probability  that  the 
announcement  does  not  go  through  the  whole  network  as 
illustrated  in  Figure  3.  Moreover,  the  condition  that  every  node 
is  monitored  by  at  least  2  nodes  indicates  that  the  illustrated 
case  is  less  probable.  In  the  second  case,  the  forwarding 
nodes  do  not  have  any  immediate  gain  for  not  forwarding  the 
announcement,  i.e.,  the  forwarder  is  indifferent  of  forwarding 
the  announcement.  However,  the  forwarding  nodes  are  better- 
off  to  forward  the  truthful  announcement  in  order  to  catch 
and  punish  the  deviating  node.  Otherwise,  the  forwarding 
nodes  may  also  become  the  victims  of  the  deviation  in  the 
future.  Moreover,  the  announcement  consumes  much  lower 
energy  compared  to  the  packet  transmission  itself.  Hence,  by 
indifferent  we  meant,  each  node  is  better  off  while  making 
a  truthful  announcement,  which  will  consume  just  a  small 
portion  of  the  energy  transmission  rather  than  a  bigger  loss 
when  it  is  deviated  by  the  deviating  node. 

Based  on  different  information  structures,  analyses  in  Sec¬ 
tion  III-A  and  Section  III-B  guarantee  that  any  individually 
rational  utilities  can  be  enforced  under  some  conditions.  How¬ 
ever,  the  individual  distributed  nodes  need  to  know  how  to 
cooperate,  i.e.  what  the  good  packet  forwarding  probabilities 
are.  In  the  next  section,  we  describe  the  learning  algorithms 
to  achieve  better  utilities. 

IV.  Self-Learning  Algorithms 

From  Section  III,  any  Pareto  dominant  solutions  better 
than  one  stage  NE  can  be  sustained.  However,  the  analysis 
does  not  explicitly  determine  which  cooperation  point  to  be 


sustained.  In  fact,  the  system  can  be  optimized  to  different 
cooperating  points,  depending  on  the  system  designer  choices. 
For  instance,  the  system  can  be  designed  to  maximize  the 
weighted  sum  of  the  average  infinitely  repeated  game’s  utilities 
as  follow 

N  N 

Usys  =  y where  ^w(z)  =  l.  (24) 
2  =  1  2=1 

In  particular,  when  w(i)  =  Vi,  maximize  the  average  utility 
per  nodes  is  usually  employed  in  network  optimization 

i  N 

Usys=^t,V~-  (25> 

V  i=l 

We  use  (25)  as  an  example,  but  we  emphasize  that  any 
system  objective  function  can  be  incorporated  into  the  learning 
algorithm  in  a  similar  way.  From  individual  point  of  view,  as 
long  as  the  cooperation  can  generate  a  better  utility  than  the 
non-cooperation,  the  autonomous  node  will  participate.  More¬ 
over,  any  optimization  other  than  the  system  optimization  can 
be  monitored  by  the  other  nodes  as  deviation.  Consequently, 
the  punishment  can  be  explored  in  the  future. 

The  basic  idea  of  the  learning  algorithm  is  to  search  itera¬ 
tively  the  good  cooperating  forwarding  probability.  Similar  to 
the  punishment  design,  we  consider  the  learning  schemes  for 
different  information  availability,  namely,  the  perfect  observ¬ 
ability  and  the  local  observability.  In  parallel  with  the  system 
model  in  Section  II,  we  consider  the  time-slotted  transmis¬ 
sion  that  interleaves  the  learning  mode  and  the  cooperation 
maintenance  mode  as  shown  in  Figure  1.  In  the  learning 
mode,  the  nodes  search  for  better  cooperating  points.  In  the 
cooperation  maintenance  mode,  nodes  monitor  the  actions  of 
other  nodes  and  apply  punishment  if  there  is  any  deviation. 
In  the  learning  mode,  the  nodes  have  no  incentives  to  deviate 
since  they  do  not  know  if  they  can  get  benefits.  So  they  do 
not  want  to  miss  the  chance  of  obtaining  the  better  utilities 
in  the  learning  mode.  It  is  also  worth  mentioning  that  if  a 
node  deviates  just  before  a  learning  period,  it  will  still  be 
punished  in  the  following  cooperation  maintenance  period.  So 
the  infinite  repeated  game  assumption  is  still  valid  in  this  time 
slotted  transmission  system. 

A.  Self-learning  under  the  perfect  observability 

Under  the  perfect  observability  information  structure,  every 
node  is  able  to  detect  the  deviation  of  any  defecting  node,  and 
observe  which  nodes  help  forwarding  others’  packets.  This  fact 
implies  that  every  node  is  able  to  perfectly  predict  the  average 
efficiencies  of  other  nodes  and  optimize  the  cooperating  point 
based  on  the  system  criterion  (25).  The  basic  idea  of  the 
learning  algorithm  is  to  use  the  steepest-descent-like  iterations. 
All  nodes  predict  the  average  efficiencies  of  the  others  and 
the  corresponding  gradients.  The  detailed  algorithm  is  listed 
as  in  Table  I.  Learning  with  perfect  observability  assumes 
the  perfect  knowledge  of  utility  functions  of  all  nodes  in  the 
network,  and  represents  the  best  solution  that  any  learning 
algorithm  can  achieve. 
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TABLE  I 

Self-Learning  Repeated-Game  Algorithm  under  Perfect 
Observability 

For  node  i:  Given  oL,,  small  increment  (3, 
and  minimum  forwarding  probability  amm 
Iteration:  t  =  1,2,--- 

Calculate  \7Usys(d(t  —  1)) 

Calculate  a(t)  =  a(t  —  1)  —  /3\7Usys(d(t  —  1)) 
Select  aflt)  =  min  {inax{[a(f)]j,  amin},  1} 


B.  Self-learning  under  the  local  observability 

In  this  subsection,  we  focus  on  the  learning  algorithm  with 
the  information  structure  available  under  local  observability. 
Under  this  condition,  the  nodes  may  not  have  the  complete 
information  about  the  exact  utility  of  others.  Based  on  this 
information  structure,  we  develop  two  learning  algorithms. 
The  first  algorithm  is  called  learning  through  flooding.  The 
second  algorithm  makes  prediction  of  the  other  nodes’  stage 
efficiency  based  on  the  flows  that  go  through  the  predicting 
node.  We  called  the  second  algorithm  as  learning  through 
utility  prediction. 

1 )  Learning  through  Flooding:  The  basic  idea  of  the  learn¬ 
ing  algorithm  is  as  follow.  Since  the  only  information  the 
node  can  observe  is  the  effect  of  changing  its  forwarding 
probability  onto  its  own  utility  function.  The  best  way  for  the 
nodes  to  learn  the  packet  forwarding  probability  is  to  gradually 
increase  the  probability  and  monitor  if  the  utility  function  be¬ 
comes  better.  If  the  utility  becomes  better,  the  new  forwarding 
probability  will  be  employed.  Otherwise,  the  old  forwarding 
probability  will  be  kept.  The  algorithm  lets  all  nodes  change 
their  packet  forwarding  probabilities  simultaneously.  This  can 
be  done  by  flooding  the  instruction  for  changing  the  packet 
forwarding  probability.  After  changing  the  packet  forwarding 
probability,  the  effect  propagates  throughout  the  network.  All 
nodes  wait  for  a  period  of  time  until  the  network  becomes 
stable.  At  the  end  of  this  period,  the  nodes  obtain  their  new 
utilities.  If  the  utilities  are  better  than  the  original  ones,  then 
the  new  packet  forwarding  probabilities  are  employed.  Other¬ 
wise,  the  old  ones  are  kept.  We  note  that  the  packet  forwarding 
probability  increment  is  proportional  to  the  increase  in  the 
utility  function:  the  nodes  with  higher  increment  in  their 
utility  functions  increase  their  forwarding  probability  more 
compared  to  the  nodes  with  lower  utility  increment.  Here,  we 
introduce  the  normalization  factor  U^,t~1(ati~1)  (the  utility 
before  changing  the  forwarding  probability)  in  order  to  keep 
the  updates  in  forwarding  probability  bounded.  The  forwarding 
probability  increment  depends  on  small  increment  constant  // 
and  the  normalization  factor.  The  above  process  is  performed 
until  no  improvement  can  be  made.  The  detailed  algorithm  is 
shown  in  Table  II. 

We  note  that  the  time  until  the  network  is  stable  is  defined 
as  the  time  until  all  of  the  nodes  do  not  observe  fluctuations 
in  their  utility  functions  as  the  result  of  flooding/changing 
forwarding  probabilities  in  the  previous  round.  In  practice, 
this  waiting  time  can  be  either  predefined  or  adjusted  online  as 
follow:  Depending  on  the  size  of  the  network,  a  waiting  period 


TABLE  II 

Self-Learning  Repeated-Game  Algorithm  (Flooding) 
Initialization:  t,  =  0 

a\  =  ao,Vz.  Choose  small  increment  rj. 
Iteration:  t  =  1,2,--- 

Calculate  [/W,t_1(a-_1)  and  +  £), 

Calculate  A [/«’*-'  =  +  £) 

_[/«>*-!  (a*-1), 

For  each  i  such  that  AU W’*_1  >  0, 


ot\  =  max(min(a.,  1),  amin). 

End  when:  No  improvement. 

Keep  monitoring  the  deviation 

Start  punishment  scheme  if  there  is  a  deviation 


will  be  set  in  each  node.  If  the  node  observes  that  its  utility 
function  fluctuates  more  than  the  preset  period  of  time,  that 
node  can  propose  to  prolong  the  preset  time  in  the  next  round 
of  flooding,  otherwise  the  old  preset  waiting  time  is  employed. 
When  a  node  observes  requests  to  prolong  the  waiting  time,  it 
sets  the  maximum  of  the  broadcasted  waiting  times  and  its  own 
waiting  time  as  the  current  waiting  time.  In  this  way,  nodes 
will  wait  until  the  effect  of  changing  forwarding  probability 
propagates  to  the  whole  network  before  the  next  flooding 
(changing  of  forwarding  probability)  happens.  The  maximum 
delay  can  also  be  set  to  keep  the  delay  time  bounded. 

2)  Learning  with  utility  prediction:  In  this  second  ap¬ 
proach,  we  observe  that  some  of  the  routing  information  can 
be  used  to  learn  the  system  optimal  solution  (25).  We  assume 
that  the  routing  decision  has  been  made  before  performing  the 
packet  forwarding  task.  For  instance,  in  the  route  discovery 
using  Dynamic  Source  Routing  (DSR)  [12]  algorithm  without 
route  caching,  the  entire  selected  route  is  included  in  the 
packet  header  in  the  packet  transmission.  The  intermediate 
nodes  use  the  route  (in  packet  header)  to  determine  to  whom 
the  packet  will  be  forwarded.  Therefore,  it  is  clear  that  the 
transmitting  node  knows  where  the  packet  goes  through,  the 
relaying  nodes  know  where  the  packet  comes  from  and  heads 
to,  and  the  receiving  node  knows  where  the  packet  comes 
from.  The  nodes  use  this  information  to  predict  the  utilities  of 
others’  nodes.  We  note  that  because  not  all  nodes  are  involved 
in  all  of  the  flows  in  the  network,  the  utility  prediction  may 
not  be  perfectly  accurate.  But  from  the  simulation  results, 
the  performance  degradation  is  minimal  since  only  the  nearby 
nodes  matter. 

The  utility  prediction  is  illustrated  using  an  example  shown 
in  Figure  4,  assuming  ps(r)  =  1,  K  =  1,  and  d(i,j)  =  1.  We 
denote  U-J'  as  the  utility  of  node  j  predicted  by  node  t.  From 
the  figure,  node  1  receives  flows  from  node  3,  and  node  4  and 
node  4  receives  flows  from  node  1  and  node  2.  It  is  obvious 
that  the  flow  from  node  2  to  node  4  is  not  perceived  by  node 
1.  Hence,  the  utilities  of  node  2  and  node  3  predicted  by  node 
1  are  not  the  accurate  ones.  Similarly,  the  flow  from  node  3 

to  node  1  is  not  perceived  by  node  4.  Therefore,  U\  and 

( 3) 

UA  ’  are  not  accurate.  The  accuracy  of  the  prediction  depends 
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U[2)=a3; 

uf1  =  0; 

r(4)  _  , 


[/<4)  =  t/<4)  = 


u\1]  indicates  the  utility  of  node  j  predicted  by  node  / 


Fig.  4.  Example  for  learning  with  utility  prediction 


on  the  flows.  If  all  flows  involving  node  i  pass  through  node 
j  then  L©  will  be  accurate  and  vice  versa  as  illustrated  in 
Figure  4.  However,  as  we  show  by  simulations  the  inaccuracy 
in  the  prediction  does  not  affect  the  results  of  optimization  too 
much. 

Since  the  objective  of  the  optimization  is  to  achieve  the 
system  optimal  solution  (25),  the  best  node  i  can  do  is  to  find 
the  solution  that  minimizes  the  total  average  predicted  utility 
function,  which  is 


vW 


), 


(26) 


S.t.  Olmin  <  <  1  ,Vj, 

where  a)  is  the  packet  forwarding  probability  that  node 
j  should  employ  as  predicted  by  node  i.  The  detailed  of 
the  algorithm  is  presented  as  in  Table  III.  The  algorithm  in 
Table  III  imitates  the  steepest-descent  algorithm  based  on 
the  predicted  utility,  where  every  node  finds  the  gradient 
of  the  predicted  utility  and  optimizes  the  predicted  system 
utility  (26).  After  obtaining  {d^},  each  node  sets  its  own 
packet  forwarding  probability  as  a-  =  a,  .  We  note  that  the 
optimization  problem  (26)  can  be  done  in  a  distributed  manner, 
since  the  optimization  does  not  require  the  global  knowledge 
of  the  utility  function.  Each  node  does  the  optimization  based 
on  its  own  prediction  and  sets  its  packet  forwarding  probability 
according  to  the  optimized  predicted  average  utility. 

Finally,  we  discuss  how  to  handle  the  mobility  of  nodes. 
We  note  that  the  scheme  will  work  well  in  moderate  node 
mobility  when  the  neighbors  of  each  node  do  not  change 
very  often.  Under  this  condition,  the  long-term  relationship 
between  nodes  can  be  established  by  means  of  the  repeated 
game  and  reputation  announcement  as  described  in  Section 
III.  As  a  result,  the  cooperation  can  be  learned  and  enforced. 

Obviously,  the  long-term  relationship  may  be  hard  to  estab¬ 
lish  in  the  case  where  there  is  a  node  that  deviates  in  one  part 
of  the  network,  moves  quickly  to  the  other  part  of  the  network, 
deviates  again  and  so  on  so  forth.  In  this  case,  there  are  two 
possible  solutions.  First,  when  the  node  moves  to  a  new  place, 
in  order  for  the  node  to  transmit,  some  background  check  is 
necessary.  This  can  be  done  in  two  ways:  first,  if  the  nearby 
nodes  can  share  the  announcement,  then  the  neighbors  of  the 


TABLE  III 

Self-Learning  Repeated-Game  Algorithm  with  Utility 
Prediction 


Initialization:  t  =  0 

(4)  t 

qc  =  dp,  Vi,  j.  Choose  small  increment  (j. 
Iteration:  t  =  1,2,  -  -  - 
For  each  node  j  =  1 ,  •  •  •  ,N 
Calculate 


Ujn)  ...  as?=i 

Nda(lht  ’  ’  Nda(N)’t 

3  3 


Calculate  a©4  =  a©4  1  +  C©^ 

Set  =  max(min(a©4, 1  ),amin). 

End  when: 


No  improvement  and  return  c©  =  c©’4,  Vi.j. 
Keep  monitoring  the  deviation,  and  go  to 
punishment  scheme  whenever  there  is  a  deviation. 


node  can  obtain  the  announcement  from  the  node’s  previous 
neighbors.  And  the  new  neighbors  will  know  the  reputation 
of  this  new  node.  The  analogy  of  this  case  in  the  real  life  is 
when  someone  applies  for  a  new  job,  the  new  employer  always 
asks  for  the  references  from  the  old  employers.  And  both 
employers  can  work  harmoniously  in  a  distributed  manner. 
In  the  literature,  the  above  idea  is  implemented  in  the  trust 
establishment  for  ad  hoc  network  such  as  [16]. 

The  other  solution  is  by  increasing  the  sampling  of  the 
learning  algorithm.  As  long  as  the  node  mobility  does  not 
change  the  relationship  between  neighboring  nodes  drastically, 
the  effect  of  mobility  to  the  learning  algorithm  can  be  lever¬ 
aged  by  putting  more  frequent  learning  period  in  the  slotted 
transmission  as  in  Figure  1.  This  case  is  similar  to  tracking 
non-stationary  channel;  the  faster  the  channel  changes  the 
more  frequent  the  training  sequence  transmission  is  required. 

V.  Simulation  Results 

To  investigate  effectiveness  of  our  proposed  framework,  we 
perform  simulations  with  the  following  settings.  We  generate 
two  networks  with  25  nodes:  the  ring-25  network  and  random- 
25  network.  The  ring-25  network  consists  of  25  nodes  that 
are  arranged  in  a  circle  with  radius  lOOCbn.  The  random-25 
network  consists  of  25  nodes  that  are  uniformly  distributed  in 
the  area  of  1000m  x  1000m.  We  define  the  maximum  distance 
dmax ,  such  that  two  nodes  are  connected  if  the  distance 
between  two  nodes  is  less  than  dmax.  We  select  the  maximum 
distance  between  two  nodes  to  ensure  connectivity  of  the 
whole  network.  In  the  ring-iV  network,  the  angle  separation 
between  two  neighboring  nodes  is  And,  the  distance 
between  two  neighboring  nodes  is  2rsin(|^),  where  r  is  the 
radius  of  the  circle.  In  particular,  the  maximum  distance  for 
the  ring-25  network  can  be  calculated  as  2000  sin  (| ^)m  = 
250.7m.  In  the  random-25  network,  the  maximum  distance 
between  two  nodes  is  350m  to  ensure  connectivity  of  the 
whole  network  with  a  high  probability. 

We  also  define  the  flows  as  source-destination  (SD)  pairs. 
We  assume  that  the  routing  decision  has  been  made  before 
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performing  packet  forwarding  optimization.  The  shortest  path 
routing  is  employed  in  the  simulations.  In  the  random-25 
network,  we  vary  the  number  of  SD  pairs.  When  there  are 
traffic  flows  from  all  nodes  to  all  other  nodes,  we  called  this 
traffic  as  dense  flow  that  implies  that  each  node  has  packets 
destined  to  the  rest  of  nodes  in  the  network.  Obviously,  the 
dense  flow  has  N  x  (N  —  1)  SD  pairs  in  the  IV-node  network. 
When  the  total  flow  is  less  than  the  dense  flow,  the  SD  pairs 
are  determined  randomly.  In  the  ring-25  network,  the  number 
of  SD  pairs  is  defined  in  the  following  way.  The  ( K  ■  N )  SD 
pairs  are  obtained  when  every  node  i  sends  packets  to  nodes 
({mod{i  +  2, 25),  •  •  •  ,  mod(i  +  K  +  1,  25)}.  For  instance,  25 
SD  pairs  are  obtained  when  every  node  i  transmits  packets 
to  node  mod(i  +  2, 25),  50  SD  pairs  are  obtained  when  every 
node  i  sends  packets  to  nodes  {mod(i+ 2,  25),  mod(i+3,  25)}, 
etc.  The  rest  of  the  simulation  parameters  are  given  as  follows, 
transmission  rate  of  source  i  as  /ij  =  1,  Vi,  transmission 
constant  K  =  1,  distant  attenuation  coefficient  7  =  4.  We 
compare  three  learning  algorithms  according  to  the  informa¬ 
tion  availability.  The  parameters  for  the  learning  algorithms 
are  listed  as  follows  (3  =  0.05,  £  =  0.001,  =  1.0,  and 

£  =  0.05.  The  minimum  forwarding  probability  is  set  to  be 
ot-min  =  0.1  and  the  maximum  forwarding  probability  is  set 
to  be  amax  =  1.  Finally,  all  algorithms  are  initiated  with 
ag  =  0.5 ,Vi.  We  note  that  in  the  following  simulations,  we 
employ  the  average  efficiency  per  node  defined  in  (25)  as  our 
performance  metric. 

Figure  5(a)  shows  the  average  efficiency  of  the  deviation 
node  in  the  ring-25  network  when  the  number  of  source- 
destination  is  75  with  the  discounted  factor  S  =  0.9.  In  the 
figure,  node  3  deviates  at  time  instant  10.  This  deviation  causes 
the  stage  efficiencies  of  node  1,  2  and  25  become  lower.  From 
the  route,  node  1,  node  2  and  node  25  suspect  that  nodes 
in  {2,3,4},  {3,4,5}  and  {1,2,3}  are  deviating,  respectively. 
The  nodes  in  the  network  know  that  node  3  is  consistent  to 
be  incriminated  for  deviation  and  start  the  punishment  stage 
(Here,  the  punishment  period  is  set  to  3).  The  punishment 
scheme  results  in  lower  average  stage  efficiency  as  described 
in  Figure  5(a).  From  the  figure,  the  average  efficiency  without 
deviation  is  better  than  the  average  efficiency  with  deviation. 
It  is  clear  that  it  is  better  off  for  node  3  to  conform  to  the 
previously  agreed  cooperation  point.  As  the  result,  no  node 
wants  to  deviate,  since  the  deviation  results  in  worse  average 
efficiency.  Similarly,  Figure  5(b)  shows  the  average  utilities  of 
deviating  node  and  other  nodes  in  the  random  network  with 
16  nodes  with  the  discounted  factor  0.9.  At  time  instant  11, 
node  10  in  the  network  deviates.  At  the  next  time  instant,  all 
related  nodes  that  detect  deviation  exchange  the  list  of  the 
incriminated  nodes.  The  consistent  incriminated  node  (in  this 
case  node  10)  is  punished  for  a  certain  period  of  time  (in  this 
figure,  8  period  of  time).  From  the  figure,  it  is  clear  that  node 
10  will  have  higher  average  efficiency  when  it  conforms.  So 
from  Figure  5(a)  and  Figure  5(b),  the  proposed  repeated  game 
can  enforce  the  cooperation  among  autonomous  greedy  nodes. 

Figure  6  and  Figure  7  show  the  learning  curves  for  the 
proposed  self-learning  repeated-game  scheme  for  the  ring- 
25  network  and  the  random-25  network,  respectively.  In  the 
figures,  we  compare  the  optimal  solution,  learning  with  perfect 
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Fig.  5.  Punishment  of  repeated  game  in  the  ring  network  and  the  random 
network 

observability,  learning  with  flooding,  and  learning  with  utility 
prediction.  In  Figure  6,  all  of  the  algorithms  achieve  the 
system  optimal  value  when  the  source-destination  pairs  are 
100,  200,  and  275.  The  learning  with  perfect  observability 
and  the  learning  with  utility  prediction  have  approximately  the 
same  convergence  speed.  The  learning  with  flooding  converges 
slower,  since  the  learning  with  flooding  does  the  trial-and- 
error  to  find  the  better  forwarding  probabilities.  This  unguided 
optimization  although  requires  minimal  information  has  the 
inferior  convergence  speed.  Figure  7  shows  the  learning  curves 
of  the  proposed  algorithms  for  random-25  network  with  differ¬ 
ent  source-destination  pairs.  One  can  observe  that  the  learning 
with  utility  prediction  achieves  very  close  efficiency  per  node 
compared  to  the  optimal  solution  and  learning  with  perfect 
observation.  In  contrast,  the  learning  with  flooding  achieves 
inferior  efficiency  per  node. 

Figure  8(a)  shows  the  learned  average  efficiency  per  node 
for  the  various  algorithms  with  different  traffic  flows  in  the 
ring-25  network.  The  efficiency  becomes  lower  as  the  number 
of  source-destination  pairs  become  larger.  This  can  be  ex¬ 
plained  as  follows.  Because  of  the  symmetric  property  of  the 
utility  functions,  the  local  optimal  forwarding  probabilities  for 
all  nodes  are  the  same.  It  can  be  easily  shown  that  the  local 
optimal  forwarding  probabilities  in  the  ring-25  network  is  1 


U(3)  in  ring-25  network 
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Learning  curve  for  different  algorithm  with 
different  number  of  SD-pairs 
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Fig.  6.  Learned  average  efficiency  per  node  for  different  traffic  loads  in  the 
ring  network 
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Fig.  7.  Learned  average  efficiency  per  node  for  different  traffic  loads  in  the 
random  network 


for  all  nodes2.  Therefore,  the  larger  the  number  of  source- 
destination  pairs,  the  more  packets  a  node  needs  to  forward 
and  the  higher  value  of  the  denominator  of  the  stage  utility 
function  in  (7).  As  the  result,  the  average  efficiency  per  node 
decreases  as  the  number  of  source-destination  increases.  Using 
simple  calculation,  it  can  be  shown  that  the  average  efficiency 
per  node  decays  is  {Ned/N+0,5^^N+1>{Ned/N)’  where  N*d 
is  the  number  of  source-destination  pairs.  In  Figure  8(a),  all 
learning  algorithms  perform  similarly  for  the  different  numbers 
of  source-destination  pairs. 

Figure  8(b)  shows  the  achievable  efficiency  per  node  after 
the  learning  algorithms  converge  for  different  numbers  of 
source-destination  pairs  in  the  random-25  network.  We  observe 
that  the  learning  with  utility  prediction  achieves  very  close 
efficiency  compared  to  the  learning  with  perfect  observation 
and  the  optimal  solution.  The  learning  with  flooding  achieves 
lower  efficiency  per  node,  but  still  achieves  much  better 

-'this  is  not  true  in  the  random  network  in  general. 


Average  efficiency  per  node  versus  the  number  of  SD-pairs 


(a)  Ring  network 


Average  efficiency  per  node  versus  the  number  of  SD-pairs 


(b)  Random  network 


Fig.  8.  Average  efficiency  per  node  for  different  traffic  loads  in  the  ring 
network  and  the  random  network 


efficiency  compared  to  the  Nash  Equilibrium.  In  average,  the 
learning  with  utility  prediction  achieves  around  99.2%  of  the 
efficiency  achieved  by  the  optimal  solution.  In  contrast,  the 
learning  with  flooding  achieves  more  than  73.18%  of  the 
optimality. 

Comparing  Figure  8(a)  and  8(b),  we  can  see  that  the 
learning  with  flooding  performs  well  in  the  ring-25  network 
but  inferior  in  the  random-25  network.  The  reason  for  this 
phenomenon  is  that  in  the  ring-25  network,  the  utilities  of  all 
nodes  are  symmetric  and  optimizing  the  system  criterion  (25) 
results  in  the  same  average  efficiency  in  each  node.  Since  the 
learning  with  flooding  tries  to  increase  its  node’s  efficiency 
by  changing  its  own  forwarding  probability  synchronously, 
this  iteration  will  finally  reach  the  point  where  all  nodes’ 
efficiencies  are  the  same  due  to  the  symmetric  structure  of 
the  network.  This  solution  is  coincidentally  the  same  as  the 
solution  of  the  system  criterion  (25)  optimization.  In  contrast 
to  the  ring-25  network,  the  utility  functions  for  each  node  are 
highly  asymmetric  in  the  random-25  network.  In  this  case,  the 
node  that  firstly  reaches  a  better  solution  will  not  change  its 
forwarding  probability,  even  though  changing  its  forwarding 
probability  results  in  slightly  lower  efficiency  in  that  particular 
node  but  increases  the  other  nodes’  efficiencies  significantly. 
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TABLE  IV 

Normalized  average  efficiency  per  node  for  different  nodes  in  the  random  network  with  dense  traffic 


Number  of  nodes 

9 

16 

25 

36 

49 

64 

81 

Average  efficiency  per  node 
(Optimal  solution) 

0.7438 

0.7581 

0.5930 

0.5574 

0.5629 

0.5316 

0.4916 

Normalized  learning 
perfect  observability 

99.63% 

99.91% 

99.39% 

100% 

100% 

100% 

99.94% 

Normalized  learning 
using  flooding 

84.79% 

71.45% 

72.81% 

65.36% 

68.56% 

58.21% 

59.40% 

Normalized  learning 
using  utility  prediction 

100% 

97.91% 

98.98% 

99.27% 

96.59% 

99.88% 

96.70% 

Due  to  this  greedy  and  unguided  optimization,  the  learning 
with  flooding  achieves  inferior  average  efficiency  per  node, 
compared  to  the  learning  using  utility  prediction  which  obtains 
information  from  routing  information  and  performs  better 
learning. 

Next,  we  investigate  the  performances  of  the  learning  al¬ 
gorithms  in  the  dense  flow  with  different  number  of  nodes  in 
the  random  network.  Table  IV  shows  the  average  efficiency 
per  node  (25)  for  different  sizes  of  the  network  normalized 
with  the  average  efficiency  obtained  by  the  optimal  solution. 
We  can  observe  that  as  the  number  of  nodes  increases,  the 
optimal  average  efficiency  per  node  decreases.  This  is  because 
the  total  power  required  for  self-transmission  and  packet¬ 
forwarding  increases  much  faster  compared  to  the  successful 
self-transmission  power,  as  the  number  of  nodes  increases. 
Therefore,  the  stage  utility  for  each  node  (7)  decreases  as 
the  number  of  nodes  increases  in  the  dense  flow.  As  the 
result,  the  average  efficiency  per  node  decreases  as  the  node 
increases.  We  also  observe  that  the  learning  with  utility 
prediction  achieves  96%  ~  100%  of  the  average  efficiency 
per  node  achieved  by  the  optimal  solution  for  various  sizes 
of  the  network.  On  the  other  hand,  the  learning  with  flooding 
achieves  60%  ~  85%  of  the  average  efficiency  obtained  by 
the  optimal  solution.  We  note  that  the  learning  using  flooding 
achieves  lower  efficiency  as  the  number  of  nodes  is  larger,  this 
is  due  to  the  unguided  optimization.  As  the  number  of  nodes 
becomes  larger,  it  is  more  probable  to  get  into  the  situation 
where  only  small  portion  of  nodes  have  high  efficiency  but  the 
rest  have  very  low  efficiencies.  In  contrast,  the  performance  of 
learning  using  utility  prediction  slightly  decreases  but  achieves 
a  very  close  performance  compared  to  the  learning  with  perfect 
observability  for  various  sizes  of  the  network  as  shown  in 
Table  IV.  The  decrease  is  because  as  the  number  of  nodes 
becomes  larger,  the  utility  prediction  becomes  less  accurate. 

VI.  Conclusions 

In  this  paper,  we  propose  a  distributed  mechanism  for 
enforcing  and  learning  the  cooperation  points  among  selfish 
nodes  in  wireless  networks.  Our  proposed  scheme  consists 
of  a  repeated-game  framework  to  enforce  cooperation  and 
learning  algorithms  to  search  for  better  cooperation  points. 
From  the  analysis  and  simulations,  we  show  that  our  proposed 
framework  is  very  effective  to  enforce  cooperation  among 
greedy/selfish  nodes.  In  practice,  selfish  nodes  with  local  infor¬ 
mation  may  not  know  how  to  cooperate  even  though  they  are 
willing  to  do  so.  We  propose  learning  algorithms  to  guide  the 


distributed  nodes  to  find  better  cooperating  points.  Depending 
on  the  information  structures,  the  proposed  learning  algorithm 
by  flooding  and  with  utility  prediction  achieve  60%  ~  85% 
and  96%  ~  100%  of  the  efficiency  that  is  obtained  by 
the  optimal  solution  with  global  information  and  centralized 
optimization. 
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